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Abstract. The Data Mining process enables the end users to analyse, 
understand and use the extracted knowledge in an intelligent system or 
to support in the decision-making processes. However, many algorithms 
used in the process encounter large quantities ol patterns, complicating 
the analysis of the patterns. This fact occurs with association rules, a 
Data Mining technique that tries to identify intrinsic patterns in large 
data sets. A method that can help the analysis of the association rules 
is the use of taxonomies in the step of post-processing knowledge. In 
this paper, the QA1ZT algorithm is proposed, which uses taxonomies to 
generalize association rules, and the RulEE-GAR computational module, 
that enables the analysis of the generalized rules. 



1 Introduction 

The development of the data storing technologies has increased the data storage 
capacity of companies. Nowadays the companies have technology to store de- 
tailed information about each performed transaction, generating large databases. 
This stored information may help the companies to improve themselves and be- 
cause of this the companies have sponsored researches and the development of 
tools to analyse the databases and generate useful information. 

During years, manual methods had been used to convert data in knowledge. 
However, the use of these methods has become expensive, time consuming, sub- 
jective and non-viable when applied at large databases. 

The problems with the manual methods stimulated the development of pro- 
cesses of automatic analysis, like the process of Knowledge Discovery in Databases 
or Data Mining. This process is defined as a process of identifying valid, novel, 
potentially useful, and ultimately understandable patterns in data [6]. 

In the Data Mining process, the use of the association rules technique may 
generate large quantities of patterns. This technique has caught the attention 
of companies and research centers [3]. Several researches have been developed 
with this technique and the results are used by the companies to improve their 



businesses (insurance policy, health policy, geo-processing, molecular biology) [8, 
4,9]. 

A way to solve the problem of the large quantities of patterns extracted by the 
association rules technique is the use of taxonomies in the step of post-processing 
knowledge [1, 8, 10]. The taxonomies may be used to prune uninteresting and/or 
redundant rules (patterns) [1]. 

In this paper the QAVff algorithm and the RulEE-GAR computational mo- 
dule is proposed. The QA1ZT algorithm (Generalization of Association Rules 
using Taxonomies) uses taxonomies to generalize association rules. The RulEE- 
GAR computational module uses the QART algorithm, to generalize association 
rules, and provides several means to analyze the generalized rules. 

This paper is organized as following: first by presenting the association rules 
technique and some general features about the use of taxonomies, second by 
describing the QAR.T algorithm and the RulEE-GAR computational module. 
Finally the results of some experiments performed with the QA1ZT algorithm 
along with our conclusion are presented. 

2 Association Rules and Taxonomies 

An association rule LHS => RHS represents a relationship between the sets 
of items LHS and RHS [2] . Each item / is an atom representing the presence 
of a particular object. The relation is characterized by two measures: support 
and confidence. The support of a rule R within a dataset D, where D itself 
is a collection of sets of items (or itemsets), is the number of transaction in 
D that contain all the elements in LHS U RHS. The confidence of the rule is 
the proportion of transactions that contain LHS U RHS with respect to the 
number of transactions that contain LHS. The problem of mining association 
rules is to generate all association rules that have support and confidence greater 
than the minimum support and minimum confidence defined by the user to mine 
association rules. High values of minimum support and minimum confidence just 
generate trivial rules. Low values of minimum support and minimum confidence 
generate large quantities of rules (patterns), complicating the user's analysis. 

A way of overcoming the difficulties in the analysis of large quantities of 
association rules is the use of taxonomies in the step of post-processing know- 
ledge. The use of taxonomies may help the user to identify interesting and useful 
knowledge in the extracted rules set. The taxonomies represent a collective or 
individual characterization of how the items can be classified hierarchically [1] . 
In Fig. 1 an example of a taxonomy is presented where it can be observed that: 
t-shirts are light clothes, shorts are light clothes, light clothes are a kind of sport 
clothes, sandals are a kind of shoes. 

In the literature there are several algorithms to generate association rules 
using taxonomies (generalized association rules). Algorithms like Cumulate and 
Stratify [10] generate rules sets larger than rules sets generated without taxo- 
nomies (because they generate association rules with and without taxonomies). 
To try decrease the quantity of generated rules, a subjective measure is used to 



prune the uninteresting rules [10]. The subjective measure does not guarantee 
that the quantity of rules will decrease. Our method proposes an algorithm and a 
module of post-processing [5]. Using the module, the user looks to a small set of 
rules without taxonomies, builds some taxonomies and then uses the algorithm 
to generalize the association rules, pruning the original rules that are generali- 
zed. Thus our algorithm always decreases or keeps the volume of the rules sets. 
The proposed algorithm and module are presented in Section 3 and 4. 




t-shirts 



shorts 



Fig. 1. An example of taxonomy for clothes. 



3 The Algorithm QA1ZT 

We analysed the structure of the association rules generated by algorithms that 
do not use taxonomies. The results of the analysis show us that it is possible 
to generalize association rules using taxonomies. In Fig. 2 we show how the 
association rules can be generalized. 

First we changed the items t-shirt and short of the rules short & slipper cap, 
sandal & short =>■ cap, sandal & t-shirt =>■ cap and slipper & t-shirt cap by the 
item light clothes (which represents a generalization) . This change generated two 
rules light clothes & slipper cap and two rules light clothes & sandal cap. 
Next, we pruned the repeated generalized rules, maintaining only the two rules: 
light clothes & slipper =>■ cap and light clothes & sandal => cap. 

The two rules generated by the Step 1 (Fig. 2) were generalized again. We 
changed the items slipper and sandal by the item light shoes (which represented 
another generalization) generating two rules light clothes & light shoes cap. 
Then we pruned the repeated generalized rules again, maintaining only one ge- 
neralized association rule: light clothes & light shoes cap. 

Due to the possibility of generalization of the association rules (Fig. 2), we 
propose an algorithm to generalize association rules. The proposed algorithm is 
illustrated in Fig. 3. We called the proposed algorithm oiQART (Generalization 
of Association Rules using Taxonomies). 
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Fig. 2. Generalization of association rules using two taxonomies. 
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Fig. 3. The proposed algorithm to generalize association rules. 



The proposed algorithm just generalizes one side of the association rules - 
LHS or RHS (after to look to a small set of rules without taxonomies, the user 
decides which side will be generalized). First, we grouped the rules in subsets 
that present equal antecedents or consequents. If the algorithm were used to 
generalize the left hand side of the rules (LHS) , the subsets would be generated 
using the equals consequents (RHS). If the algorithm were used to generalize 
the right hand side of the rules (RHS), the subsets would be generated using 
the equal antecedents (LHS). Next, we used the taxonomies to generalize each 
subset (as illustrated in Fig. 2). In the final algorithm we stored the rules in a 
set of generalized association rules. 



In the final algorithm, we also calculated the Contingency Table for each 
generalized association rules to get more information about the rules. The Con- 
tingency Table of a rule represents the coverage of the rule with respect to the 
database used in its mining [7] . With the calculation of the Contingency Table 
we finished the algorithm. 



4 The Computational Module RulEE-GAR 

In this section we present the RulEE-GAR computational module that provides 
means to generalize association rules and also to analyze the generalized rules [5] . 
The generalization of the association rules is performed by the QA1ZT algorithm, 
described in the previous section. Next we describe the means to analyze the 
generalized association rules. In Fig. 4 we showed the screen of the interface 
that enables the user to analyze and to explore the generalized rules sets. 
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Fig. 4. Screen of the analysis interface of generalized association rules. 



On the screen of the analysis interface of generalized rules (Fig. 4) there 
are some spaces where the user puts data to make a query and select a set of 
generalized rules, accompanied or not of several evaluation measures [7], to be 
analyzed. Besides allowing the user to select a set of rules, the interface provides 
four links in the section Downloads to look for and/or download the files. The 
files contain, respectively, the set of transactional data (Data Set), the set of 



source rules (Rule Set), the set of generalized rules (Generalized Rule Set) and 
the set of taxonomies used to generalize the rules ( Taxonomy Set) . 

Besides links for visualization and/or download of the files, each generalized 
association rule presents others links that enable the user to explorer information 
about the generalization of the rule. The links are positioned at the left side of 
the rules (Fig. 4). The links are described as following: 

Expanded Rule It is represented in the interface by the letter "E" . This link 
enables the user to see the generalized rule in expanded way. The generalized 
items of a rule are changed by the respective specific items. 

Source Rules It is represented in the interface by the letter "S". This link 
enables the user to see the source rules that were generalized. 

Measures It is represented in the interface by the letter "M". This link is 
available only if the user selects the support (Sup) and/or confidence (Cov) 
measures in its query and these measures present values lower than the 
minimum support and/or minimum confidence values defined to the mining 
process of the rules set not generalized. With this link it is possible to see 
which generalized rules have support and/or confidence values lower than 
the minimum support and/or minimum confidence values. 

In Fig. 4 we also see that the generalized items in a rule (items between 
parentheses) are presented as links. These links enable the user to see the source 
items that were generalized. In the analysis interface, the user can also store the 
information, selected by the query, in a text file. 

5 Experiments 

We performed some experiments using the QART algorithm to demonstrate that 
the use of taxonomies, to generalize large rules sets, reduces large quantities of 
association rules and makes easy the analysis of the rules. 

The experiments were performed using a sale database of a Brazilian super- 
market. The database contained sales data of the recent 3 month. We made 4 
partitions of the database to perform the experiments. The partitions were made 
using the sale data along of 1 day, 7 days, 14 days and 1 month. 

To generate the association rules, we used the implementation of the Apriori 
algorithm performed by Chistian Borgelt 3 with minimum support value equal 
0.5, minimum confidence value equal 0.5 and a maximum number of 5 items by 
rule. The generated rules sets are described as following: 

— RuleSet.lday - 32668 rules generated using the partition of 1 day; 

— RuleSet_7days - 19166 rules generated using the partition of 7 days; 

— RuleSet_14days - 16053 rules generated using the partition of 14 days; 

— RulcSct_lmonth - 21505 rules generated using the partition of 1 month; 

3 Available for downloading at the web site http://fuzzy.cs.uni-magdeburg.de/ 
-borgelt/software .html. 



— RulcSet_3months - 19936 rules generated using the whole database (3 months 
of sale data). 

To perform the experiments, we looked to the database and to the 5 sets 
of association rules generated to make 18 sets of taxonomies. Then we ran the 
GA1ZT algorithm combining each set of taxonomies with each set of rules. In 
Fig. 5 a chart is presented that shows the reduction rates of the 5 rules sets 
after running QA1ZT algorithm using the 18 sets of taxonomies to generalize 
each rules set. In Fig. 5 the sets of taxonomies are called "T" followed by an 
identification number, as for example: T01. 

As it can be observed in Fig. 5, the experiments show reduction rates of the 
sets of association rules varying from 14,61% to 50,11%. 




Fig. 5. Reduction rates got using taxonomies to generalize association rules. 



6 Conclusion 

A problem found in the Data Mining process is the fact that several of the 
used algorithms generate large quantities of patterns, complicating the analysis 
of the patterns. This problem occurs with the association rules, a Data Mining 
technique that tries to identify all the patterns in a database. 

The use of taxonomies, in the step of knowledge post-processing, to generalize 
and to prune uninteresting and/or redundant rules may help the user to analyze 
the generated association rules. 



In this paper we proposed the QARTf algorithm that uses taxonomies to ge- 
neralize association rules. We also proposed the RulEE-GAR computational mo- 
dule that uses the QART algorithm to generalize association rules and provides 
several means to analyse the generalized association rules. Then we presented 
the results of some experiments performed to demonstrate that the QARTf al- 
gorithm may reduce the volume of the sets of association rules. As the sets of 
taxonomies were made by the user, others sets of taxonomies may generate re- 
duction rates higher than the rates presented in our experiments, mainly whether 
the sets were made by experts in the application domain. 
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